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Abstract 

The use of community detection algorithms is explored within the framework 
of cover song identification, i.e. the automatic detection of different audio 
renditions of the same underlying musical piece. Until now, this task has 
been posed as a typical query-by-example task, where one submits a query 
song and the system retrieves a list of possible matches ranked by their 
similarity to the query. In this work, we propose a new approach which 
uses song communities (clusters, groups) to provide more relevant answers 
to a given query. Starting from the output of a state-of-the-art system, 
songs are embedded in a complex weighted network whose links represent 
similarity (related musical content). Communities inside the network are 
then recognized as groups of covers and this information is used to enhance 
the results of the system. In particular, we show that this approach increases 
both the coherence and the accuracy of the system. Furthermore, we provide 
insight into the internal organization of individual cover song communities, 
showing that there is a tendency for the original song to be central within 
the community. We postulate that the methods and results presented here 
could be relevant to other query-by-example tasks. 
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1. Introduction 



Audio cover song identification is the task of automatically detecting 
which songs are versions of the same underlying musical piece usin g only 
information extracted from their raw audio signal (jSerra et al.l . |2010| ) . This 
addresses an important problem faced by modern society: the classification 
and organization of digital information. More concretely, it add resses the 
detection of near-duplicate musical documents (ICasey et al.l . l2008l ). 

Cover song identification is a challenging task, since cover songs might 
differ from their originals in several musical aspects such as timbre, tempo, 
song structu r e, ma in tonality, arrangement, lyrics, or language of the vocals 



(jSerra et all |2010| ). Nevertheless, the identification of cover song versions 



has been a very active area of study within the rnusic information re t rieval 
(MIR) commu nity over the last years (ISerra et all I2OIOI : ICasey et all 12008 : 
Downid . l2008l ). Thanks to these efforts, and to the development of a num- 
ber of specific tool s to extract and analyze musical information from audio 
(ICasey et alll2008l ). we now dispose of a variety of metri cs for the estimation 
of the similarity between cover songs (ISerra et al.l . I2OIOI ). 

These metrics are commonly used to search for covers in a music col- 
lection, ranking the relevance of each song to a given query. Indeed, cover 
song identification has been traditiona lly set up as a typical informat i on re- 
trieval (IR) task of q uery-by-example (IBaeza- Yates and Ribeiro-Netd . Il999 : 
Manning et all 120081 ) . where the user submits a query (a song) and receives 
an answer back (a list of songs ranked by their relevance to the query). In 
the present article we propose a novel approach: after processing isolated 
queries through query-by-example, systems may focus on groups of items, 
with the new aim of identifying communities of songs within a given music 
collectior|§. 

Using such a strategy has many intuitive advantages. Importantly, one 
should bear in mind that these advantages are not specific for the cover 
song detecti on task, and hold for any IR systena operating through qu ery- 



by-example (IBaeza- Yates and Ribeiro-Netd . Il999l : iManning et all |2008|) , in 



^Through the manuscript we use the words group, set, community, or ckister inter- 
changeably. 
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eludi ng analogous systems sueli as reeommendation systems (jResnick and Varian 



19971 ). First, given that current systems provide a suitable metric to quan- 



tify the similarity between query items, several well-researched options ex- 
i st to exploit this infornaation in order to detect inherent groups of items 



fIXu and Wunsch 111 . l2009l : iJain et al.l . Il999l : iFortunato and Castellanol . 12009 : 



Danon et al.l . l2005l ). Second, focusing on groups of items may help the sys- 
tem in retrieving more coherent answers for isolated queries. In particular, 
the answers to any query belonging to a given group would coherently con- 
tain the other songs in the group, an advantage that is not guaranteed by 
query-by-example systems alone. Third, music collections are usually or- 
ganized and structured on multiple scales. Thus we can infer and exploit 
these regularities to increase the overall accuracy of traditional cover song 
identification systems. Note that the two previous advantages specifically 
aim to achieve higher user satisfaction and confidence in IR systems, as they 
can be perceived as rational agents or assistants. Finally, once groups of 
coherent items are correctly detected, one can study these groups in order 
to retrieve new information, either from the individual communities or from 
the relations between these. 

In this article, for automatically identifying cover song sets (or groups) in 
a music collection we employ a number of unsupervised grouping al g orithm s 



on top of a state-of-the-art query -by-example system (ISerra et al.l. l2009a[). 



We consider clustering algorithms ( IXu and Wunsch Ill.l2009l:IJain et al.l.ll999l) 
and, i n particular, commu nity detection algorithms (IFortunato and Castellano , 



20091 : iDanon et al.l . |2005| ). The reader may easily see the resemblance be- 



tween the detection of cover song sets and a more class i cal community de 
tectio n task inside a complex network ( Boccaletti et al.l 2006 : Costa et al. 
20081 ). This way, a set of nodes .jV = {ni,n2, . . . ,71^} represents the 



recordings being analyzed, and the elements of the N x N weight matrix 
W represent the distance (dissimilarity) between any couple of nodes. Pro- 
vided that the weights of this matrix are assigned with the help of a suit- 
able cover song dissimilarity metric (e.g. the same one used to originally 
rank the answer to a query), communities inside this complex network will 
represent sets of recordings with related musical content. Although com- 
plex networks and community detectioi i algorithms have been used in many 



probl ems involving complex systems (iBoccaletti et al.l . l2006l: I Cost a et al 



20081). and more specifically in studying ni usical networks (iBuldu et al.l . 12007 : 



Teitelbaum et al.l . 120081 : ICano et al.l . l2006l ) , to the best of our knowledge they 
have never been applied in the context of a retrieval task before. The only 
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exception is our previous work (jSerra et al.l . l2009bl ). of which the present 



article shows considerable extensions, improvements, and new results. An 
alternative technique for i mproving cover song retrieval was considered in 
( Lagrange and Serra . 2010l ). 

We now provide a brief overview of our main contributions and, at the 
same time, outline the remaining structure of the article. We first build and 
analyze a cover song network (Sec. [2]). To this end we apply a state-of-the- 
art algorithm for cover song similarity to an in-house music collection. We 
then do an analysis of this network, both of its topology and of the char- 
acteristics of the percolation process. Within this analysis we find a strong 
modular structure, with well-defined communities and a clustering coefficient 
higher than expected in an equivalent random network. This confirms our 
intuitive reasoning that cover songs naturally cluster into cover song sets. 
With this knowledge we can then safely proceed to detect the actual sets of 
covers based on the output of the state-of-the-art algorithm (Sec. [3]). For 
that, several clustering and community detection strategies are compared. 
Four of these strategies are based on community detection in complex net- 
works, of which three of them are novel contributions. An assessment of the 
computation time of all the considered methods is also done. Next, we show 
how query-by-example results can be improved by incorporating the infor- 
mation obtained through the group detection stage into the system (Sec. H]). 
Indeed, our results show a coherent increase in the accuracy of the system, 
with particularly promising values for community detection methods. This 
confirms our intuitive reasoning that exploiting the regularities found in the 
answers given by a query-by-example system can lead to an overall accuracy 
increase. Finally, we focus on the internal organization of cover song sets. 
More concretely, a pioneering study of the role that original songs (i.e. the 
ones performed by the original author or artist) play within a group of covers 
is done (Sec. |5]). To the authors' knowledge, the present study is the first at- 
tempt done in this direction. In particular, we show that there is a tendency 
for the original song to be central within the community. A short conclusions 
section closes the article (Sec. E]). 



2. Cover song networks 

2.1. Building the network 

The first step required by our proposal is to create a network and to 
embed nodes (songs) into it. We use an in-house music collection of 2125 
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songs comprising a variety of genres an d styles. This collection is an extension 
of the one used by ISerra et al.l (l2009al ). to which we refer for further details, 
and consists of 523 non-overlapping groups of cover songs, each group having 
an identificatory label which we use in the evaluation stages. The cardinality 
of these groups, i.e. the number of songs per group, varies between 2 and 18, 
with an expected value of 4. 

Links between network nodes should represent the cover song relationship 
between corresponding musical pieces (the dissimilarity between their musical 
content). Therefore, an algorithm to compute this dissimilarity is needed in 
order to calculate the elements Wi^ of the matrix W for each couple of nodes 
Hi and Uj. Several altern atives for such di ssimilarity measures have been 
proposed in the literat ure (ISerra et al.U2010r) . In particular, we use the Qmax 
measure presented by ISerra et al.l (l2009al ). This measure allows to track 
all potential differences between cover songs of the same underlying musical 
piece (Sec. [1]). However, in spite of being one of the most promising strategies 
proposed so far, its accuracy is not perfect. This is a further motivation to 
improve the accuracy of the system through a post-processing step based on 
cover set detection. 

A brief outline of the Qmax measure follows. First, a time series of mu- 
sical descriptors is extracted for all songs. In the c ase of cover songs, tonal 
similarity is commonly exploited (ISerra et al.l . 120101). In part icular, Qmax em- 
ploys time series of pitch class profiles (PCP; iGomezl . 120061 ). PCP features 
estimate the amount of energy for each musical note of the Western musical 
scale that is present in a short analysis frame of the raw audio signal. This 
analysis is performed in a moving window, leading to a time series that is ro- 
bust against non-tonal components (e.g. ambient noise or percussive sounds), 
and independent of timbre and the specific instruments used. Furthermore, 
PCPs are independent of a musical piece's loudness and volume fluctuations. 
As cover versions may be played in different tonalities (e.g. to be adapted 
to the characteristics of a particular singer or instrument) one has to tackle 
differences in the r aain key of the son g. This can be effectively done through 
various strategies ( ISerra et al.l . 120101 ) . 

From the above PCP time series, on e forms a state space repres entation 
for each song using delay coordinates (iKantz and Schreiberl . 120041 ). These 
representations are then compared on a pairwise basis through a cross recur- 
rence plot (CRP), which is the bivariate generaliz ation of classical recurrence 
plots (lEckmann et al.l . Il987t iMarwan et al.l . 120071 ) . Finally, the Qmax measure 



is used to extract features that are sensitive to cover song CRP characteris- 
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tics. This measure was deri ved from a previously published RQA measure 
[-f'max, lEckmann et al.l (Il987l )]. but adapted to the problem at hand by allow- 
ing to track c urved and potentiall y disrupted traces in a GRP. Despite this 
adaptation, in ISerra et al.l (l2009al ) we showed that the Qmax measure is not 
restricted to MIR nor to the particular application of cover song identifica- 
tion. 

An example of the abovementioned process is shown in Figs. [T] and [21 
which compares the song "Rock around the clock" as performed by Elvis 
Presley versus a version performed by The Sex Pistols. Since it is not the ob- 
jective of this article to thorough l y prese nt the Qmax measure, the interested 
reader is referred to ISerra et al.l (l2009al ) for further details. A comprehen 



sive o verview of cover song similarity measures can be found in ISerra et al. 

fcoioh . 

The symmetric measure Qmax represents similarity: the higher the value, 
the more similar both analyzed recordings are in terms of their tonal musical 
cont ent. To fill the weigh ted adjacency matrix W of the network, we proceed 
as in ISerra et al.l ( I2009cl ) and convert Qmax to a dissimilarity value by taking 



w 



i-,3 



Q 



max Sj^ 



(1) 



where \sj\ is proportional to the duration of song sj and 

have 1 

tin} represent 



[l,max(|sj|, \sj\)]. Notice that Wij 



[Si,Sj) e 

Wj^i, iff Si and sj have the same dura- 



tion. Recall that the nodes of the network ^ = {ni,n2, 
the N recordings Si being analyzed. 



2.2. Analysis of the network 

The result of the previous procedure over the available data is a weighted 
directed graph expressing cover song relationships. This resulting network 
is represented in Fig. [31 A threshold has been applied so that only pairs of 
nodes with Wij < 0.2 are drawn. Some clusters, that is, sets of covers, are 
already visible, especially in the external zones of the network. 

In order to understand how the network evolves when the threshold is 
modified, we represent six different classical network metr ics as a function 



of th e threshold (Fig. HJ. These metrics correspond to ( iBoccaletti et al 



20061 ): graph density, number of independent compon ents, size of the strong 
giant component, number of isolated nodes, efficiency (iLatora and Marchioril . 
200ll ). and clustering coefficient. In the same plots, we also display the values 
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Figure 1: CRP for the song "Rock around the clock" as performed by Elvis Presley 
(x axis) and The Sex Pistols (y axis). Axes represent time and black dots represent 
correspondences between the tonal content of both songs. We see quite long black traces 
through the CRP, which are usually not straight diagonals but curved and disrupted ones, 
indicating similarly evolving temporal patterns in both song representations. 

for the last five measures as expected in random networks with the same 
number of nodes and links. 

By looking at the evolution of these metrics, we can infer some inter- 
esting knowledge about the network and its inherent structure. Notice that 
when reducing the threshold (and therefore increasing the deleted links), the 
network splits into a higher number of clusters than expected (Fig. |U top 
right), which represents the formation of cover song communities. This pro- 
cess begins around a threshold of 0.5 (see, for instance, the evolution of the 
size of the strong giant component). When these communities are formed, 
they maintain a high clustering coefficient and a high triangular coherence 
(bottom right graph of Fig. HJ between 0.3 and 0.5), i.e. sub-networks of cov- 
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Figure 2: The Q matrix (jSerra et all l2009al ) for the same pair of songs. This matrix quan- 
tifies the lengths of the previously mentioned black traces. The Qmax measure corresponds 
to the maximum value in Q (in this example Qmax = 46.6). 



ers tend to be fully connected. It is also interesting to note that the number 
of isolated nodes remains lower than expected, except for high thresholds 
(Fig. m middle right). This suggests that most of the songs are connected to 
some cluster while a small group of them are different, with unique musical 
features. We found nearly identical results using a symmetric dissimilarity 
matrix W with = w^- ^ = (wjj + Wj^i) /2. 

3. Detecting groups of covers 

We assess the detection of cover sets (or communities) by evaluating a 
number of unsupervised methods either based on clustering or on complex 
networks. Three of these are novel approaches. Since standard implementa- 
tions of clustering algorithms do not operate with an asymmetric dissimilar- 
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Figure 3: Graphical representation of the cover song network when a threshold of 0.2 is 
applied. Original songs are drawn in blue, while covers are in black. In Sec. El the role of 
original songs inside each community will be further studied. 



ity measure, in this section and in the subsequent one we use the symmetric 
dissimilarity matrix W explained above. 

3.1. Methods 

K-medoids K-medoids (KM) is a classical technique to group a set of ob- 
jects inside a previously known number of K clusters. This algorithm is 
a common choice when the computation of means is unavailable (as it 
solely operates on pairwise distances) and can exhibit some advantages 
compared to the standard K-means algorithm ( Xu and Wunschll . 2009 ). 



in particular with noisy samples. The main drawback for its application 
is that, as well as with the K-means algorithm, the K-medoids algo- 
rithm needs to set K, the number of expected clusters. However, several 
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Figure 4: (Black solid lines) Evolution of six metrics of the network as a function of the 
threshold. These metrics are, from top left to bottom right: graph density, number of 
independent components, size of the strong giant component, number of isolated nodes, 
efficiency, and clustering coefficient. (Red dashed lines) Expected value in a random 
network with the same number of nodes and links. Nearly identical figures were obtained 
when considering a symmetric dissimilarity matrix (see text). 



10 



heuristics can be used for that purpose. We employ the K-medoids im- 
plementation of the tamo packagqj, which incorporates several heuris- 
tics to achieve an optimal K value. We use the default parameters and 
try all possible heuristics provided in the implementation. 

Hierarchical clustering Four represe ntative a^Klomerative hierarchical clu s 



tering methods have been tested (IXu and Wunsch 111 . l2009l : iJain et al 



19991 ): single linkage (SL), complete linkage (CL), group average linkage 



(UPGMA), and weighted average linkage (WPGMA). We use the hdus- 
ter implementatioiQ with the default parameters, and we try different 
cluster validity criteria such as checking descendants for inconsistent 
values, or considering the maximal or the average inter-cluster cophe- 
netic distance. Thus, in the end, all clustering algorithms rely only on 
the definition of a distance threshold d^jj, which is set experimentally. 

Modularity optimization This method (MO), as well as the next three 
algorithms, is designed to exploit a complex network collaborative ap- 
proach. MO extracts the community structure fro m large networks 

based on the optim i zation of the network modularity (IFortunato and Castellano , 



m 



20091: iDanon et al.l. 120051) . In particular, we use the method proposed 



Blondel et al. with the implementation by Aynauclf]. This 



method is reported to outperform all other known community detec- 
tion algorithms in terms of computational time while still maintaining 
a high accuracy. 

Proposed method 1 Our first proposed method (PMl) applies a threshold 
to each network link in order to create an unweighted network where 
two nodes are connected only if their weight (dissimilarity) is less than 
a certain value w^h- In addition, for each row of W, we only allow 
a maximum number of connections, considering only the lowest values 
of the thresholded row as valid links. That is, we only consider the 
first r^pj^ nearest neighbors for each node (values w^j^ and r^j^ are set 
experimentally). Finally, each connected component is assigned to be 
a group of covers. Although this is a very naive approach, it will be 



'http : //f raenkel . mit . edu/TAMO 

^http : / / code . google . com/p/ scipy-cluster 

"http: //perso . crans . org/ ~aynaud/communities/index. html 
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shown that, given the considered network and dissimilarity measure, it 
achieves a high accuracy level at low computational costs. 

Proposed method 2 The previous approach could be further improved by 
reinforcing triangular connections in the complex network before the 
last step of checking for connected components. In other words, pro- 
posed method 2 (PM2) tries to reduce the "uncertainty" generated by 
triplets of nodes connected by two edges and to reinforce coherence in 
a triangular sense. 

This idea can be illustrated by the following example (Fig. |5]). Suppose 
that three nodes in the network, e.g. rii, rij, and n^, are covers: the 
resulting subnetwork should be triangular, so that every node is con- 
nected with the two remaining ones. On the other hand, if rii, rij, and 
Uk are not covers, no edge should exist between them. If couples ni,nj 
and ni,nk are respectively connected (Fig. [5j^.), we can induce more 
coherence by either deleting one of the existing edges (Fig. 133), or by 
creating a connection between Uj and (i.e. forcing the existence of 
a triangle. Fig. EP). This coherence can be measured through an ob- 
jective function /o which considers complete and incomplete triangles 
in the whole graph. We define fo as a weighted difference between 
the number of complete triangles and the number of incomplete 
triangles A^v (three vertices connected by only two links) that can be 
computed from a pair of vertices: fo{N^, iVy) = A^^ — aA^v The con- 
stant a, which weights the penalization for having incomplete triangles, 
is set experimentally. 

The implementation of this idea sequentially analyzes each pair of ver- 
tices riijUj by calculating the value of fo for two situations: (i) when 
an edge between rii and rij is artificially created and (ii) when such 
an edge is deleted. Then, the option which maximizes fo is kept and 
the adjacency matrix is updated as necessary. The process of assigning 
cover sets is the same as in PMl. 

Proposed method 3 The computation time of the previous method can 
be substantially reduced by considering for the computation of fo only 
those vertices whose connections seem to be uncertain. This is what 
proposed method 3 (PM3) does: if the dissimilarity between two songs 
is extremely high or low, this means that the cover song identification 
system has clearly detected a match or a mismatch. Accordingly, we 
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B) 



A) 



.0 



Figure 5: Example of the process of reinforcing the triangular coherence of the network. 
The sub- network in the left part (A) can be improved by either deleting a link (B), or by 
adding a third link between the two nodes that were not originally connected (C). 



only consider for /o the pairs of vertices whose edge weight is close to 



w. 



Th 



(a closeness margin is empirically set). 



3.2. Evaluation methodology 

The experimental setup is an important aspect to be considered when 
evaluating cover so ng identification sy stems. Each setup is defined by dif- 



ferent parameters ( ISerra et al.l . l2009bl ): the total number of songs A^, the 
number of cover sets A^c the collection includes, the cardinality C of the 
cover sets (i.e. the number of songs in the set), and the number of added 
noise songs A^n (i-e. songs that do not belong to any cover set, which are 
included to add difficulty to the task). Becau se some setups can lead to 



wrong accuracy estimations ( Serra et al. . 2010l ). it is safer to consider sev- 



eral of them, including fixed and variable cardinalities. In our experiments 
we use the setups summarized in Table [1] The whole network analyzed in 
Sec. 12.21 corresponds to setup 3. For other setups we randomly sample cover 
sets from setup 3 and repeat the experiments A^t times. We either sample 
cover sets with a fixed cardinality (C = 4, the expected cardinality of setup 
3) or without fixing it (variable cardinality, C = z/). For sampled setups, the 
average accuracies reported. 

To quantitatively evaluate cover set (or c ommunity) detection we use to 
the c l assical F-measure wit h even weighting ( iBaeza- Yates and Ribeiro-Netd . 



19991 : Manning et all . l2008f ) 



2PR 
P + R' 



(2) 
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Table 1: Experimental setup summary. The (•) delimiters denote expected value. 



which goes from (worst case) to 1 (best case). In Eq. ([2]), P and R cor- 
respond to precision and recall, respectively. For our evaluation, 



we com- 



pute these two quantities independently for all songs and average after 



wards, i.e. unlike other clustering evaluation measures (jSahoo et al.l . l2006l ). 
F is not computed on a per-cluster basis, but on a per-song basis. This 
way, and in contrast to the typical clustering F-measur e or other c l uster 



ing evaluation measures like Purity, Entropy, or F-Score fISahoo et al.l . 12006 



Zhao and Karypid . |2002[ ) , we do not have to blindly choose which cluster is 
the representative for a given cover set. 

For each song Sj, we count the number of true positives (i.e. the number 
of actual cover songs of Sj estimated to belong to the the same community 
as Si), the number of false positives t[' (i.e. the number of songs estimated 
to belong to the same group as Sj that are actually not covers of Si) and the 
number of false negatives (i.e. the number of actual covers of Si that are 
not detected to belong to the same group as si). Then we define 



T,- + r • 



and 



Rj 



(3) 



(4) 



These two quantities [Eqs. (E]) and dl])] are averaged across all songs {i 
1, . . . A^) to obtain P and R, respectively. 



14 



Algorithm Setup 

22 2^3 2A 3 



KM 

SL 

CL 



0.66 
0.79 
0.81 
0.82 
0.83 
0.80 
0.81 
0.77 
0.79 



0.66 
0.81 
0.82 
0.83 
0.84 
0.83 
0.83 
0.77 
0.79 



0.68 
0.88 
0.83 
0.83 
0.84 
0.89 
0.88 



0.69 
0.89 
0.83 
0.83 
0.84 
0.89 
0.89 



0.78 
0.79 
0.79 
0.82 
0.81 
0.81 



n.c. 



UPGMA 
WPGMA 



MO 
PMl 
PM2 
PM3 



0.87 



n.c. 



0.88 



n.c. 



0.76 



n.c. 



Table 2: Accuracy F for the considered algorithms and setups (see Tabled] for the details 
on the different setups). Due to algorithms' complexity, some results were not computed 
(denoted as n.c). 

3.3. Results 

To assess the algorithms' accuracy we independently optimized all possi- 
ble parameters for each algorithm. This optimization was done in-sample by 
a grid search, trying to maximize F on the randomly chosen songs of setups 
1.1 to 1.4. Within this optimization phase, we saw that the definition of a 
threshold (either d'r^^i clustering algorithms or w'^^ for community detec- 
tion algorithms) was, in general, the only critical parameter for all algorithms 
(for our proposed methods we used r^h between 1 and 3). All other param- 
eters turned out not to be critical for obtaining near-optimal accuracies. 
Methods that had specially broad ranges of these near-optimal accuracies 
were KM, PM2, and all considered hierarchical clustering algorithms. 

We report the out-of-sample accuracies F for setups 2.1 to 3 in Table [2j 
Overall, the high F values obtained (above 0.8 in the majority of the cases, 
some of them nearly reaching 0.9) indicate that the considered approaches are 
able to effectively detect groups of cover songs. This allows the possibility to 
reinforce the coherence within answers and to enhance the answer of a query- 
based retrieval system (see Sec. H]). In particular, we see that accuracies for 
PMl and PM3 are comparable to the ones achieved by the other algorithms 
and, in some setups, even better. We also see that KM and PM2 perform 
worst. 
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3.4- Computation time 

In the application of these techniques to big real-world music collections, 
computational complexity is of great importance. To qualitatively evaluate 
this aspect, we report the average amount of time spent by the algorithms 
to achieve a solution for each setup (Fig. [6]). We see that KM and PM2 are 
completely inadequate for processing collections with more than 2000 songs 
(e.g. setup 3). The steep rise in the time spent by hierarchical clustering 
algorithms to find a cluster solution for setup 3 also raises some doubts as to 
the usefulness of these algorithms for huge music collections [0(A^^ log A^), 



Jain et al.l (119991 )]. Furthermore, hierarchical clustering algorithms, as well 



as the KM algorithm, take the full pairwise dissimilarity matrix as input. 
Therefore, with a music collection of, say, 10 million songs, this distance 
matrix might be difficult to handle. 

In contrast, algorithms based on complex networks show a better perfor- 
mance (with the aforementioned exception of PM2). More specifically, MO, 
PMl, and PM3 use local information (the nearest neighbors of the queries). 
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while PM3 furthermore acts on a small subset of the links. It should also 
be noticed that the resul ting network is yery sp arse, i.e. the number of links 



is much lower than N'^ (IBoccaletti et al.l . |2006| ) and, therefore, calculations 



on such graphs can be strongly optimized both in me mory requirements an d 



computational costs [as demonstrated, for instance, by lBlondel et al.l ( 120081 ). 



who have applied their method to networks of millions of nodes and links]. 

4. Improving the accuracy through community detection 

In this section we investigate the use of the information obtained through 
the detection of communities to increase the overall accuracy of a query 
system. 

4.1. Method 

Given the dissimilarity matrix W and a solution for the cluster or com- 
munity detection problem, one can calculate a refined dissimilarity matrix 
W by setting 

w'- ■ 

Wij = + (5) 

where = if Sj and Sj are estimated to be in the same community 
and Pij = c otherwise. For ensuring songs in the same community to have 
Wij < 1 and others to have Wij > 1, we use a constant c > 1. This refined 
matrix W can be used again to rank query answers according to cover song 
similarity and consequently, when compared to the initial W of the original 
system, to evaluate the accuracy increase obtained. 

4-2. Evaluation methodology 

A common measure to evaluate query-by- example system s is th e mean 



of average precisions (MAP) over all queries (jManning et al.l . 120081 ). which 
we denote as (P). To calculate such a measure, one averages across each of 
the answers Ai to queries Si, Ai being an ascendingly ordered list according 
to the rows of W (or W, depending on which solution we evaluate). More 
concretely, the average precision Pi for a query song Si is calculated from the 
retrieved answer Aj as 



^-1 

^^yE^'W^^W' (6) 

r=l 
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where Pi is the precision of the sorted hst Ai at rank r, 

^^w = ^E^^(o, (7) 

1=1 

and li is a relevance function such that Ii{z) = 1 if the song with rank z in 
Ai is a cover of Sj, /^(-z) = otherwise. We then define the relative MAP 
increase as 




4-3. Results 

To assess the algorithms' accuracy we independently optimized the pa- 
rameters for each algorithm as explained in the previous section. However, we 
now try to maximize (P) instead of F. We notice that these new thresholds 
can be different from the ones used in Sec. [3], therefore implying that the best 
performing methods of SecOwill not necessarily yield the highest increments 
A. In particular, clustering and community detection algorithms giving bet- 
ter community detection and more suitable false positives will achieve the 
highest increments. Thus, due to the definition of W [Eq. ([5])], the role of 
false positives becomes important. Furthermore, due to the use of differ- 
ent evaluation metrics, small changes in the optimal parameters might be 
necessary. 

To illustrate the above reasoning regarding false positives consider the 
following example. Suppose the first items of the ranked answer to a given 
query Sj are A^ = {sj, Sk, si, Sm, ■ ■ •}, where s indicates effective (real) mem- 
bership to the same cover song group. Now suppose that clustering algorithm 
CAl selects songs s,, Sj, and Sk as belonging to the same cluster. In addition, 
suppose that clustering algorithm CA2 selects Sj, Sk, si, and Sm- Both clus- 
tering algorithms would have the same recall R but CAl will have a higher 
precision P, and therefore a higher accuracy value F [Eqs. ([2}|1])]. Then, 
by Eq. (jS]), the refined answer for CAl becomes A} = {sj, Sk, si, Sm, ■ ■ ■}, 
the same as A^. On the other hand, the refined answer for CA2 becomes 
Af = {sk, Si, Sm, Sj, . . .}. This implies that, when evaluating the relative ac- 
curacy increment A [Eqs. ([MHD], CA2 will take a higher MAP value 

than CAl, since Sk is ranked before Sj in A^. Therefore, with regard to rel- 
ative increments A, and contrastingly to accuracy F, CAl will not improve 
the result, while CA2 will. 
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Algorithm 






Setup 








2.1 


2.2 


2.3 


2.4 


3 


KM 


2.2b 


2.4U 


2.06 


2.29 


n.c. 


SL 


2.26 


2.40 


1.16 


2.29 


2.05 


CL 


1.93 


1.19 


1.43 


1.10 


1.28 


UPGMA 


5.87 


5.22 


3.96 


3.49 


4.37 


WPGMA 


4.91 


3.58 


3.83 


2.67 


3.60 


MO 


6.84 


5.37 


5.14 


2.94 


5.54 


PMl 


6.15 


5.70 


4.95 


3.28 


5.49 


PM2 


5.98 


4.85 


n.c. 


n.c. 


n.c. 


PM3 


6.05 


5.10 


3.81 


2.97 


4.73 



Table 3: Relative MAP increase A for the considered setups (see Table [T] for the details 
on the different setups). Due to algorithms' complexity, some results were not computed 
(denoted as n.c). 



We report the out-of-sample accuracy increments A for setups 2.1 to 3 
in Table El Overall, these are between 3% and 5% for UPGMA, WPGMA, 
MO, and all PMs, with some of them reaching 6%. We see that, in general, 
methods based on complex networks perform better, specially MO and PMl. 
We also see that the inclusion of "noise songs" (iVw = 400, setups 2.3 and 
2.4) affects the performance of nearly all algorithms (with the exception of 
poorly performing ones). 

A further out-of-sample test was done within the MIREX audio cover 
song identification contest. The MIR evaluation exchange (MIREX) is an 
international community -based f r amew ork for the formal evaluation of MIR 
systems and algorithms (iDownid . l2008l ). Among other tasks, MIREX allows 
for an objective assessment of the accuracy of different cover song identifica- 
tion algorithms. For that purpose, participants can submit their algorithms 
as binary executables ( black box, without disclosing any details), 

and the MIREX organizers determine and publish the algorithms' accura- 
cies and runtimes. The underlying music collections are never published or 
disclosed to the participants, either before or after the contest. Therefore, 
participants cannot tune their algorithms to the music collections used in the 
evaluation process. In the editions of 2008 and 2009 we submitted the same 
two versions of our system and obtained the two highest accuracies achieved 
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to dat^ ( Serra et al. . 2009c ). The first version of the system (submitted to 
both editions) corresponded to the Qmax measure alone, while the second 
version (also submitted to both editions) comprised Qmax plus PM10 and the 
dissimilarity update of Eq. ([5]). The MAP (|P) achieved with the former was 
0.66 while with the latter was 0.75. This corresponds to a relative increment 
A = 13.64, which is substantially higher than the ones achieved here with 
our data, most probably because the setup for the MIREX task is Nq = 30, 
C = 11, and A^N = 0. Such setup might capitalize the effects that community 
detection can have in improving the accuracy. In particular, the techniques 
presented here have greater potential of increasing the final accuracies when 
high cardinalities are considered. 



5. The role of the original song within a cover song community 

From a music perception and cog nition point of v iew, a musical work or 
song can be considered as a category (IZbikowskil . |2002| ) . Categories are one of 
the basic devices to repr e sent knowledge, either by humans or by machines 
(IRogers and McClellandl . |2004| ) . According to existing empirical evidence, 
some authors postulate that our brain builds categories around prototypes, 
which encapsulate the statistically most-prevalent cate gory features, and 
again st which potential category members are compared (IRosch and Mervid . 
19751 ). Under this view, after the listening of several cover songs, a prototype 
for the underlying musical piece would be abstracted by listeners. This pro- 
totype might encapsulate features like the presence of certain motives, chord 
progressions, or contrasts among different musical elements. In this scenario, 
new items will be then ju dged in relation to t he pr ototype, forming gradients 
of category membership ( Rosch and Mervis . 19751 ). 

In the context of cover song communities, we hypothesize that these gra- 
dients of category membership, in a majority of cases, might point to the 
original song, i.e. the one which was firstly released. In particular we conjec- 
ture that, in one way or another, all cover songs inherit some characteristics 



^The results for 2008 and 2009 are available from |http : //music- ir . org/mirex/2008 
and'http : //music-ir . org/mirex/2009, respectively. We did not participate in the 2010 
edition because the MIREX evaluation dataset was kept the same and we did not have 
any new algorithm to submit. 

'^We just submitted PMl because it was the only algorithm we had available at that 
time. 
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from this "original prototype". This feature, combined with the fact that 
new versions might as well be inspired by other covers, leads us to infer that 
the original song occupies a central positio n within a coyer son g community, 
being a referential or "best example" of it ( Serra et al. . 2009bl ). 



To evaluate this hypothesis we manually check for original versions in 
setup 3 and discard the sets that do not have an original, i.e. the ones where 
the oldest song was not performed by the original artist. Here we make an 
oversimplification and assume that the most well-known (or popular) version 
of a song is the original one. This allows us to objectively "mark" our cover 
songs with a label stating if they are actually the original version, thus avoid- 
ing to make subjective judgments about a song's popularity with regard to 
its covers. Following this criteria, we find 426 originals out of 523 cover sets. 
Through this section, we employ the directed weighted graph defined by the 
asymmetric matrix W (Sees. [2TT] and [2^2]) . 

Initial supporting evidence that the original song is central within its 
community is given by Figs. [7] and [HI In Fig. [TJ we depict the resulting 
network after the application of a strong threshold (only using Wij < 0.1). 
We see that communities are well defined and also that many of the original 
songs are usually "the center" of their communities. In Fig. [HI two cumulative 
distributions have been calculated: one for the weights of links exiting an 
original song (performed by the original artist, black solid line), and one for 
links exiting covers (performed by the original artist or another one after the 
original recording was made, blue dashed line). The plot of these cumulative 
distributions indicates that original songs tend to be connected to other nodes 
through links with smaller weights, that is, lower dissimilarities. 

To evaluate the aforementioned hypothesis in a more formal way, we pro- 
pose a study of the ability to automatically detect the original version within 
a community of covers. To this extent, we consider an "ideal" community de- 
tection algorithm (i.e. an algorithm detecting cover song communities with 
no false positives and no false negatives) and propose two different meth- 
ods. These methods are based on the structure of weights of the obtained 
sub-network after the ideal community detection algorithm has been applied. 



Clos eness centrality This a lgorithm estimates the centrality (iBoccaletti et al. 



20061 : iBarrat et al.l . l2004l ) of a node by calculating the mean path length 
between that node, and any other node in the sub-network. Note that 
the sub-network is fully connected, as no threshold has been applied 
in this phase. Therefore, the shortest path is usually the direct one. 
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Figure 7: Graphical representation of the cover song network with a threshold of 0.1. 
Original songs are drawn in blue, while covers are in black. 

Mathematically, let W^^^ be the sub-network containing the /c-th cover 
song community. Then the index / of the original (or prototype) song 
sf' of the /c-th community corresponds to 



arg mm 

l<i<C{'=) 



w. 



(k) 



(9) 



where C^'^^ is the cardinality of the k-th cover song community. Notice 
that a similar methodolo gy is employed in the clustering context to in fer 
the medoid of a cluster ( IXu and Wunsch 111 . l2009t IJain et al.l . 119991 ) . 



MST centrality In this second algorithm we reinforce the role of central 
nodes. First, we calculate the minimum spanning tree (MST) for the 
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Figure 8: Cumulative weights distributions for links in the network, divided between links 
outgoing from an original song (black solid line) and from a cover song (blue dashed line) 
songs. 

sub- network under analysis. After that, we apply the previously de- 
scribed closeness centrality [Eq. OH])] to the resulting graph. 

The results in Table H] show the percentage of hits and misses for the 
detection of original songs in dependence of the cardinality of the consid- 
ered cover song community. We report results for C between 2 and 7 (the 
cardinalities for which our music collection has a representative number of 
communities Nq). The percentage of hits and misses can be compared to the 
null hypothesis of randomly selecting one song in the community. 

We observe that, in general, accuracies are around 50% and, in some cases, 
they reach values of 60%. An accuracy of exactly 50% is obtained with C = 2 
by both the null hypothesis and the MST centrality algorithm. This is be- 
cause the MST is defined undirected, and there is no way to discriminate the 
original song in a sub- network of two nodes. As soon as C > 2, accuracies be- 
come greater than the null hypothesis and statistical significance arises. Sta- 
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Algorithm 








C 








2 


3 


4 


5 


6 


7 


Closeness centrality 


59.4** 


53.6** 


43.1* 


60.5** 


48.0** 


27.2 


MST centrality 


50.0 


52.4** 


60.7** 


52.6** 


48.0** 


63.6** 


Null hypothesis 


50.0 


33.3 


25.0 


20.0 


16.7 


14.3 




190 


82 


51 


38 


25 


11 



Table 4: Percentage of hits and misses for the original song detection task depending on 
the cardinality C of the cover song communities. The * and ** symbols denote statistical 
significance at p < 0.05 andp < 0.01, respectively. The last line shows Nq, i.e. the number 
of communities for each cardinality. 



tistical significance is assessed with the binomial test ( iKvam and Vidakovid . 



20071 ). 



With this experiment we show that the original song tends to occupy a 
central position within its group and, therefore, that a measure of centrality 
can be used to discriminate it from a group of covers. The same concepts 
of centrality may be valid for alternative dissimilarity measur es representing 



musical aspects s uch as timbre, rhythm, or structure (c.f. iDownid . 12008 



Casey et al.l . l2008l ). Thus, one could think of incorporating information from 



these other aspects of the audio content in order to improve the accuracy of 
the task. A more complicated, if not impossible, task would be to detect the 
original song in a pairwise basis. To this extent, works on ra odeling court 
decisions like the ones from iMiillensiefen and PendzichI ( 120091 ) come closer. 
In general, for detecting original songs, information coming from the audio 
content alone may be insufficient. Essential temporal aspects (in a historical 
sense) are absent in such information and, for incorporating them, we should 
gather data from cultural and editorial sources. This goes without saying 
that, probably, high accuracies are unreachable and, more importantly, that 
the concept of originality is a very particular one, placed in a specific cultural 
context and epoch. Indeed, the digital revoluti on of the last years is beginning 
to question such a concept ( iFitzpatrickl . |2009| ) . 



6. Conclusions 

In this article we built and analyzed a musical network refiecting cover 
song communities, where nodes corresponded to different audio recordings 
and links between them represented a measure of resemblance between their 
musical content. In addition, we analyzed the possibility of using such a 
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network to apply different community detection algorithms to detect coher- 
ent groups of cover songs. Three versions of such algorithms were proposed. 
These algorithms achieved comparable accuracies when compared to exist- 
ing state-of-the-art methods, with similar or even faster computation times. 
Furthermore, we provide evidence that the knowledge acquired through com- 
munity detection is valuable in improving the raw results of a query-based 
cover song identification system. Finally, we discussed a particular outcome 
from considering cover song communities, namely the analysis of the role of 
the original song within its covers. We showed that the original song tends 
to occupy a central position within its group and, therefore, that a mea- 
sure of centrality can be used to discriminate original from cover songs when 
the sub-network of these communities is considered. To the best of authors' 
knowledge, the present work is the first attempt done in this direction. 

In the light of these results, complex networks stand as a promising re- 
search line within the specific task of cover detection; but, at the same time, 
the proposed approach can be applied to any query-by-exa mple IR system 



(IBaeza- Yates and Ribeiro-Netd . Il999l : iManning et al.l. 120081) . and especiall y 



to other query-by-example MIR systems (iDownid . 120081 : iCasey et al.l . 120081 ) 
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